import os
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from time import time
from sklearn.compose import ColumnTransformer
from scipy.stats import norm
import scipy.stats as stats
import numpy as np
Inimse hingamise sageduse ja teiste füüsiliste parameetrite regressioonanalüüs¶
ITB8814 Andmekaevandamine, Projekt¶
Autor: Juri Lunin¶
Kuupäev 17.05.2024
Sissejuhatus¶
Käesolevas projektitöö on valitud uurimiseks andmestik: "Energy Expenditure of Human Physical Activity".
Andmestiku kirjeldus¶
Andmestikus on inimeste füüsilised omadused ja info nende seisundi kohta, mis avaldub füüsilise aktiivsuse ajal. Energy Expenditure of Human Physical Activity Faili formaat: csv*.
Andmestikuga koos on postitatud kaks teadusliku artikli:
- Activity recognition using wearable sensors for tracking the elderly
- A recurrent neural network architecture to model physical activity energy expenditure in older people
Andmestiku tunnused:
- ID - participant's ID
- trial_date - date and time when data collection started at ID level
- gender - sex = male or female
- age - in years
- weight - in kg
- height - in cm
- bmi - Body mass index in kg/m
- gaAnkle - TRUE if data from GENEActiv on the ankle exist, FALSE otherwise
- gaChest - TRUE if data from GENEActiv on the chest exist, FALSE otherwise
- gaWrist - TRUE if data from GENEActiv on the wrist exist, FALSE otherwise
- equivital - TRUE if data from Equivital exist, FALSE otherwise
- cosmed - TRUE if data from COSMED exist, FALSE otherwise
- EEm - Energy Expenditure per minute, in Kcal
- COSMEDset_row - the original indexes of COSMED data (used for merging)
- EEh - Energy Expenditure per hour, in Kcal
- EEtot - Total Kcal spent (it is reseted between indoor and outdoor measurements)
- METS - Metabolic Equivalent per minute
- Rf - Respiratory Frequency (litre/min)
- BR - Breath Rate
- VT - Tidal Volume in litre
- VE - Expiratory Minute Ventilation (litre/min)
- VO2 - Oxygen Uptake (ml/min)
- VCO2 - Carbon Dioxide production (ml/min)
- O2exp - Volume of O2 expired (ml/min)
- CO2exp - Volume of CO2 expired (ml/min)
- FeO2 - Averaged expiratory concentration of O2 (%)
- FeCO2 - Averaged expiratory concentration of CO2 (%)
- FiO2 - Fraction of inspired O2 (%)
- FiCO2 - Fraction of inspired CO2 (%)
- VE.VO2 - Ventilatory equivalent for O2
- VE.VCO2 - Ventilatory equivalent for CO2
- R - Respiratory Quotient
- Ti - Duration of Inspiration (seconds)
- Te - Duration of Expiration (seconds)
- Ttot - Duration of Total breathing cycle (seconds)
- VO2.HR - Oxygen pulse (ml/beat)
- HR - Heart Rate
- Qt - Cardiac output (litre)
- SV - Stroke volume (litre/min)
- original_activity_labels - True activity label as noted from study protocol, NA if is unknown
- predicted_activity_label - Predicted activity label by model from [1], NA if is unknown
Uurimiseesmärk¶
Prognoositav tunnus ehk sihttunnus on Y=“BR” breath rate.
Eesmärk: Parima mitme argumendiga regressiooni mudeli leidmine sihttunnuse BR prognoosimiseks.
Tööülesanded (tööhüpoteesid)¶
- Madalama hingamise sagedusega inimestel on rohkem aktiivsust ning hingamine on sügavam.
Metoodika ja uuringu käik¶
Käesoleva projektis kasutasime järgnevaid analüüsi meetodeid:
- Lineaarregressioon
- Polünomiaalregressioon
- Otsustuspuu regressiooni mudel
- Random forest regressiooni mudel
_DATA_PATH = 'data/EEHPA.csv'
_SIHTTUNNUS_ = 'BR'
_GOAL_ = 'Leida parima mitme argumendiga regressiooni mudeli sihttunnuse BR ennustamise jaoks.'
_DROP_= ['age', 'weight', 'height', 'bmi', 'ID',
'gaAnkle', 'gaChest', 'gaWrist', 'equivital', 'cosmed', 'COSMEDset_row',
'trial_date',
'VE.VO2', 'VE.VCO2', 'R', 'FiO2', 'FiCO2',
'VO2', 'VCO2', 'EEm', 'EEh',
'Ti', 'Te', 'Ttot',
'Qt', 'SV',
'predicted_activity_label']
_OBJ_CAST_= []
_DROP_UNNAMED_ = True
Andmestikus on liiga palju attribuute, mis ei mõjuta oluliselt uuringu tulemusi. Kustutame selliseid andmestikust kohe. Mõned tunnused on tuvastatud mitteolulisteks seoste analüüsi käigus.
Kustutame 'ID'. See ei mängi rolli uuringus.
Kustutame boolean tunnused seadmetest andmete kohta: 'gaAnkle', 'gaChest', 'gaWrist', 'equivital', 'cosmed' Need ei ole nii tähtsad ja nende väärtused on enamasti 'True'.
Kustutame ka 'COSMEDset_row', 'trial_date', Need ei mängi rolli praeguses uuringus.
Kustutame 'FiO2', 'FiCO2' ning 'Qt', 'SV'. Nendende vahel on ebamäärane seos, mis selgus seoste analüüsimisel.
Seoste analüüsi käigus on selgunud järgmised nõrgad ja loomulikud seosed, millised ei näita tugevat korrelatsiooni ega ole niivõrd huvitavad uuringus. Samuti need võivad segada, mis selgus peale Clustermap analüüsi:
Kustutame 'age', 'weight', 'height', 'bmi'. Kustutame 'VE.VO2', 'VE.VCO2', 'R'. Nende kohta võiks eraldi uuringu teha, kuid praegu nad on liiga nõrga korrelatsiooniga.
Lineaarregressiooni mudeli tehes on avastatud, et attribuudid 'Ti', 'Te', 'Ttot' on liiga kõrge mõjuga, mis on loomulik ja need ei ole lineaarses sõltuvuses teiste tunnustega. Kustutame 'Ti', 'Te', 'Ttot'.
Kustutame ka 'VO2', 'VCO2', 'EEm', 'EEh'. Liiga suur kordajate vahe.
Kustutame attribuudi 'predicted_activity_label', kuna see pärineb teise mudeli konstrueerimisest ning võib tekitada müra.
Andmeanalüüs¶
Andmete lugemine¶
df = pd.read_csv(_DATA_PATH)
df
| ID | trial_date | gender | age | weight | height | bmi | gaAnkle | gaChest | gaWrist | ... | R | Ti | Te | Ttot | VO2.HR | HR | Qt | SV | original_activity_labels | predicted_activity_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GOTOV05 | 08/02/2016 13:42 | female | 61.6000 | 68.6000 | 162 | 26.1000 | True | True | True | ... | 0.9846 | 0.9300 | 1.8600 | 2.7900 | 2.1250 | 102 | 0 | 0 | NaN | sitting |
| 1 | GOTOV05 | 08/02/2016 13:42 | female | 61.6000 | 68.6000 | 162 | 26.1000 | True | True | True | ... | 1.0035 | 1.2600 | 1.1800 | 2.4400 | 2.2403 | 103 | 0 | 0 | NaN | NaN |
| 2 | GOTOV05 | 08/02/2016 13:42 | female | 61.6000 | 68.6000 | 162 | 26.1000 | True | True | True | ... | 1.0399 | 0.9700 | 1.6900 | 2.6600 | 4.2051 | 104 | 0 | 0 | NaN | NaN |
| 3 | GOTOV05 | 08/02/2016 13:42 | female | 61.6000 | 68.6000 | 162 | 26.1000 | True | True | True | ... | 1.0635 | 0.9600 | 2.0400 | 3.0000 | 4.3329 | 106 | 0 | 0 | lyingDownRight | standing |
| 4 | GOTOV05 | 08/02/2016 13:42 | female | 61.6000 | 68.6000 | 162 | 26.1000 | True | True | True | ... | 1.0307 | 1.1500 | 1.5700 | 2.7200 | 2.6086 | 106 | 0 | 0 | lyingDownRight | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 38718 | GOTOV36 | 30/05/2016 10:41 | female | 81.3000 | 72.0000 | 167 | 25.8000 | True | True | True | ... | 0.8145 | 1.0100 | 1.7400 | 2.7500 | 3.9375 | 77 | 0 | 0 | NaN | standing |
| 38719 | GOTOV36 | 30/05/2016 10:41 | female | 81.3000 | 72.0000 | 167 | 25.8000 | True | True | True | ... | 0.7833 | 1.1100 | 1.9100 | 3.0200 | 3.1757 | 77 | 0 | 0 | NaN | NaN |
| 38720 | GOTOV36 | 30/05/2016 10:41 | female | 81.3000 | 72.0000 | 167 | 25.8000 | True | True | True | ... | 0.7643 | 0.8300 | 1.4100 | 2.2400 | 7.3279 | 77 | 0 | 0 | NaN | NaN |
| 38721 | GOTOV36 | 30/05/2016 10:41 | female | 81.3000 | 72.0000 | 167 | 25.8000 | True | True | True | ... | 0.7948 | 1.2500 | 2.7700 | 4.0200 | 4.4365 | 77 | 0 | 0 | NaN | NaN |
| 38722 | GOTOV36 | 30/05/2016 10:41 | female | 81.3000 | 72.0000 | 167 | 25.8000 | True | True | True | ... | 0.7068 | 1.1300 | 1.2900 | 2.4200 | 4.7681 | 77 | 0 | 0 | NaN | NaN |
38723 rows × 41 columns
print(f"Andmestikus on \033[1m{df.shape[0]}\033[0m ridu, neid iseloomustab \033[1m{df.shape[1]}\033[0m tunnust.")
Andmestikus on 38723 ridu, neid iseloomustab 41 tunnust.
Andmestiku muutujad:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 38723 entries, 0 to 38722 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 38723 non-null object 1 trial_date 38723 non-null object 2 gender 38723 non-null object 3 age 38723 non-null float64 4 weight 38723 non-null float64 5 height 38723 non-null int64 6 bmi 38723 non-null float64 7 gaAnkle 38723 non-null bool 8 gaChest 38723 non-null bool 9 gaWrist 38723 non-null bool 10 equivital 38723 non-null bool 11 cosmed 38723 non-null bool 12 EEm 38723 non-null float64 13 COSMEDset_row 38723 non-null int64 14 EEh 38723 non-null float64 15 EEtot 38723 non-null float64 16 METS 38723 non-null float64 17 Rf 38723 non-null float64 18 BR 38723 non-null int64 19 VT 38723 non-null float64 20 VE 38723 non-null float64 21 VO2 38723 non-null float64 22 VCO2 38723 non-null float64 23 O2exp 38723 non-null float64 24 CO2exp 38723 non-null float64 25 FeO2 38723 non-null float64 26 FeCO2 38723 non-null float64 27 FiO2 38723 non-null float64 28 FiCO2 38723 non-null float64 29 VE.VO2 38723 non-null float64 30 VE.VCO2 38723 non-null float64 31 R 38723 non-null float64 32 Ti 38723 non-null float64 33 Te 38723 non-null float64 34 Ttot 38723 non-null float64 35 VO2.HR 38723 non-null float64 36 HR 38723 non-null int64 37 Qt 38723 non-null int64 38 SV 38723 non-null int64 39 original_activity_labels 24452 non-null object 40 predicted_activity_label 11395 non-null object dtypes: bool(5), float64(25), int64(6), object(5) memory usage: 10.8+ MB
print_prop_num_count = len(df.select_dtypes(exclude=object).columns)
print_prop_obj_count = len(df.select_dtypes(include=object).columns)
print(f"Andmestikus on \033[1m{print_prop_num_count}\033[0m arvulist ja \033[1m{print_prop_obj_count}\033[0m mittearvulist muutujat")
Andmestikus on 36 arvulist ja 5 mittearvulist muutujat
if len(_OBJ_CAST_) > 0:
obj_cast_l = {}
for i in _OBJ_CAST_:
if i in df:
obj_cast_l.update({i: str})
df = df.astype(obj_cast_l)
df.info()
if len(_OBJ_CAST_) > 0:
print("Teisendame järgmised tunnused kategoriaalseteks:")
print(f"\033[1m{[i for i in _OBJ_CAST_]}\033[0m")
print_prop_num_count = len(df.select_dtypes(exclude=object).columns)
print_prop_obj_count = len(df.select_dtypes(include=object).columns)
print(f"Andmestikus on \033[1m{print_prop_num_count}\033[0m arvulist ja \033[1m{print_prop_obj_count}\033[0m mittearvulist muutujat")
Andmestiku puhastamine¶
Duplikaatide kontroll¶
duplicates = df.duplicated(keep='first').sum();
print(f"Andmestikus on leitud {duplicates} duplikaate.")
Andmestikus on leitud 28 duplikaate.
df.drop_duplicates(keep='first',inplace=True)
duplicates_new = df.duplicated(keep='first').sum()
print(f"Peale puhastamist andmestikus on leitud {duplicates_new} duplikaate.")
Peale puhastamist andmestikus on leitud 0 duplikaate.
Tunnuste teisendamine¶
Puuduvate andmetega objektide kontroll¶
missing_values_validation = df.isna().sum()
missing_values_validation
ID 0 trial_date 0 gender 0 age 0 weight 0 height 0 bmi 0 gaAnkle 0 gaChest 0 gaWrist 0 equivital 0 cosmed 0 EEm 0 COSMEDset_row 0 EEh 0 EEtot 0 METS 0 Rf 0 BR 0 VT 0 VE 0 VO2 0 VCO2 0 O2exp 0 CO2exp 0 FeO2 0 FeCO2 0 FiO2 0 FiCO2 0 VE.VO2 0 VE.VCO2 0 R 0 Ti 0 Te 0 Ttot 0 VO2.HR 0 HR 0 Qt 0 SV 0 original_activity_labels 14263 predicted_activity_label 27328 dtype: int64
Puuduvate väärtustega 'N/A' kirjed puuduvad.
print(f"Puhastatud andmestikus on \033[1m{df.shape[0]}\033[0m ridu, neid iseloomustab \033[1m{df.shape[1]}\033[0m tunnust.")
Puhastatud andmestikus on 38695 ridu, neid iseloomustab 41 tunnust.
Mittevajalike tunnuste eemaldamine.
if len(_DROP_) > 0:
prep_validated_drop_list = []
for i in _DROP_:
if i in df:
prep_validated_drop_list.append(i)
if len(prep_validated_drop_list) > 0:
df.drop(columns=prep_validated_drop_list,axis = 1, inplace = True)
if _DROP_UNNAMED_:
prep_empty_cols = df.columns.str.contains('unnamed',case = False)
if len(np.where(prep_empty_cols == True)) > 0:
df.drop(df.columns[prep_empty_cols],axis = 1, inplace = True)
df
| gender | EEtot | METS | Rf | BR | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | HR | original_activity_labels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | female | 0.0000 | 0.9528 | 21.5054 | 92 | 0.4091 | 8.7978 | 73.1476 | 12.4463 | 17.8803 | 3.0424 | 2.1250 | 102 | NaN |
| 1 | female | 0.0097 | 1.0143 | 24.5902 | 90 | 0.4642 | 11.4144 | 85.4919 | 11.8342 | 18.4176 | 2.5495 | 2.2403 | 103 | NaN |
| 2 | female | 0.0940 | 1.9223 | 22.5564 | 84 | 0.8774 | 19.7902 | 159.3537 | 25.3007 | 18.1628 | 2.8837 | 4.2051 | 104 | NaN |
| 3 | female | 0.2080 | 2.0189 | 20.0000 | 85 | 0.9243 | 18.4858 | 164.5567 | 30.6057 | 17.8035 | 3.3113 | 4.3329 | 106 | lyingDownRight |
| 4 | female | 0.2703 | 1.2154 | 22.0588 | 89 | 0.5876 | 12.9624 | 107.3240 | 16.2210 | 18.2639 | 2.7604 | 2.6086 | 106 | lyingDownRight |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 38718 | female | 85.4218 | 1.2031 | 21.8182 | 90 | 0.5650 | 12.3281 | 102.0947 | 13.8710 | 18.0686 | 2.4549 | 3.9375 | 77 | NaN |
| 38719 | female | 85.4871 | 0.9703 | 19.8675 | 92 | 0.5314 | 10.5573 | 96.9951 | 11.8312 | 18.2534 | 2.2265 | 3.1757 | 77 | NaN |
| 38720 | female | 85.5828 | 2.2391 | 26.7857 | 87 | 0.6436 | 17.2386 | 110.4581 | 19.6846 | 17.1632 | 3.0586 | 7.3279 | 77 | NaN |
| 38721 | female | 85.7260 | 1.3556 | 14.9254 | 92 | 0.7017 | 10.4733 | 120.3514 | 22.2344 | 17.1512 | 3.1686 | 4.4365 | 77 | NaN |
| 38722 | female | 86.1708 | 1.4569 | 24.7934 | 89 | 0.6028 | 14.9449 | 109.3362 | 12.8511 | 18.1387 | 2.1320 | 4.7681 | 77 | NaN |
38695 rows × 14 columns
print_prop_num_count = len(df.select_dtypes(exclude=object).columns)
print_prop_obj_count = len(df.select_dtypes(include=object).columns)
print(f"Andmestikus on \033[1m{print_prop_num_count}\033[0m arvulist ja \033[1m{print_prop_obj_count}\033[0m mittearvulist muutujat")
Andmestikus on 12 arvulist ja 2 mittearvulist muutujat
Andmestik sobib mudelitele ilma muutusteta.
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| EEtot | 38695.00 | 73.08 | 52.11 | 0.00 | 34.37 | 63.59 | 100.59 | 291.44 |
| METS | 38695.00 | 2.85 | 1.81 | 0.00 | 1.47 | 2.39 | 3.87 | 15.23 |
| Rf | 38695.00 | 23.99 | 10.80 | 2.89 | 18.13 | 22.30 | 27.78 | 375.00 |
| BR | 38695.00 | 82.02 | 10.42 | 23.00 | 77.00 | 85.00 | 90.00 | 99.00 |
| VT | 38695.00 | 1.19 | 0.60 | 0.04 | 0.76 | 1.08 | 1.50 | 4.64 |
| VE | 38695.00 | 27.73 | 16.97 | 0.20 | 15.40 | 22.84 | 35.15 | 132.13 |
| O2exp | 38695.00 | 207.24 | 103.41 | 7.55 | 134.52 | 186.63 | 257.73 | 969.38 |
| CO2exp | 38695.00 | 39.00 | 23.83 | 0.00 | 21.42 | 33.96 | 51.19 | 167.76 |
| FeO2 | 38695.00 | 17.49 | 0.75 | 12.74 | 17.05 | 17.51 | 17.94 | 22.43 |
| FeCO2 | 38695.00 | 3.11 | 0.68 | 0.00 | 2.70 | 3.09 | 3.53 | 6.11 |
| VO2.HR | 38695.00 | 7.94 | 4.99 | 0.00 | 4.60 | 7.55 | 11.12 | 39.39 |
| HR | 38695.00 | 81.68 | 36.18 | 0.00 | 69.00 | 83.00 | 104.00 | 203.00 |
Visualiseerime arvuliste muutujate jaotused. Selleks eraldame andmestikust arvulised muutujad:
feature_columns=df.drop(_SIHTTUNNUS_, axis=1).select_dtypes(exclude=object).columns
feature_columns
Index(['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2',
'VO2.HR', 'HR'],
dtype='object')
Arvuliste tunnuste jaotuste visualiseerimine: histogramm + karpdiagramm
fig, axs = plt.subplots(len(feature_columns),2,dpi=95,figsize=(15,30))
i = 0
for col in feature_columns:
df[col].plot(kind='hist',ax=axs[i,0], title=col, color="steelblue")
df[col].plot(kind='box',vert=False,ax=axs[i,1], title=col,
patch_artist = True,
boxprops = dict(facecolor = "steelblue"),
medianprops = dict(color = "red", linewidth = 1.5)).set_yticklabels('')
i+=1
fig.tight_layout()
plt.show()
Enamik arvulisi tunnuseid on parempoolse assümeetriaga. Vurrdiagrammidest on näha, et outliereid on rohkelt.
Mittearvuliste atribuutide kirjeldus¶
Mittearvuliste tunnuste karakteristikud:
df.describe(include=[object]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| gender | 38695 | 2 | male | 24786 |
| original_activity_labels | 24432 | 16 | cycling | 4212 |
Kõige sagedamini esineb käesolevas andmestikus rattaga sõitmine meestel.
Mittearvuliste tunnuste väärtuste sagedustabelid:
for column in df.select_dtypes(include=object).columns:
print(column)
print(df[column].value_counts().sort_index())
print()
gender gender female 13909 male 24786 Name: count, dtype: int64 original_activity_labels original_activity_labels cycling 4212 dishwashing 1885 lyingDownLeft 1459 lyingDownRight 1307 sittingChair 1546 sittingCouch 1623 sittingSofa 1549 stakingShelves 1689 standing 1102 step 413 syncJumping 161 vacuumCleaning 1744 walkingFast 1997 walkingNormal 1883 walkingSlow 1694 walkingStairsUp 168 Name: count, dtype: int64
Cycling - rattaga sõitmine on kõige sagedam tegevus käesolevas andmestikus.
Sihttunnuse Y kirjeldus¶
print(f"Sihttunnus: Y = \033[1m{_SIHTTUNNUS_}\033[0m.")
Sihttunnus: Y = BR.
Sihttunnuse jaotuse visualiseerimine:
fig, axs = plt.subplots(1,2,dpi=95,figsize=(15,5))
df[_SIHTTUNNUS_].plot(kind='hist',ax=axs[0], title="{}".format(_SIHTTUNNUS_), color="steelblue")
df[_SIHTTUNNUS_].plot(kind='box',vert=False,ax=axs[1], title="{}".format(_SIHTTUNNUS_),
patch_artist = True,
boxprops = dict(facecolor = "steelblue"),
medianprops = dict(color = "red", linewidth = 1.5))
plt.show()
Sihttunnuse histogramm näitab vasakpoolset assümmeetriat. Mudelite konstrueerimisel on saadud piisav jõudlus, mis ei nõua sihttunnuse teisendamist.
Seoste analüüs¶
Seosed arvuliste tunnuste vahel¶
Visualiseerime arvuliste tunnuste vahelised sõltuvused
sns.pairplot(df.select_dtypes(exclude=object))
plt.show()
Lineaarne sõltuvus on selgelt väljendatud järgmiste tunnuste vahel:
- BR (Breath Rate) - \ - VE (Expiratory Minute Ventilation (litre/min))
- VT (Tidal Volume in litre) - / - O2exp (Volume of O2 expired (ml/min))
- VT (Tidal Volume in litre) - / - CO2exp (Volume of CO2 expired (ml/min))
- O2exp (Volume of O2 expired (ml/min)) - / - CO2exp (Volume of CO2 expired (ml/min))
- FeO2 (Averaged expiratory concentration of O2 (%)) - \ - FeCO2 (Averaged expiratory concentration of CO2 (%))
Teistel tunnuste paaridel on nõrk sõltuvus või suur hajutatus ja keerulisem sõltuvuse struktuur.
Korrelatsioonimaatriks¶
Arvuliste tunnuste korrelatsioonimaatriks:
df.select_dtypes(exclude=object).corr()
| EEtot | METS | Rf | BR | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | HR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EEtot | 1.00 | 0.46 | 0.18 | -0.51 | 0.43 | 0.55 | 0.44 | 0.40 | -0.02 | 0.16 | 0.35 | 0.32 |
| METS | 0.46 | 1.00 | 0.26 | -0.91 | 0.73 | 0.90 | 0.71 | 0.77 | -0.44 | 0.56 | 0.69 | 0.39 |
| Rf | 0.18 | 0.26 | 1.00 | -0.35 | -0.14 | 0.32 | -0.13 | -0.15 | 0.23 | -0.19 | 0.10 | 0.14 |
| BR | -0.51 | -0.91 | -0.35 | 1.00 | -0.73 | -0.97 | -0.74 | -0.67 | 0.14 | -0.31 | -0.63 | -0.47 |
| VT | 0.43 | 0.73 | -0.14 | -0.73 | 1.00 | 0.79 | 1.00 | 0.95 | -0.32 | 0.47 | 0.68 | 0.44 |
| VE | 0.55 | 0.90 | 0.32 | -0.97 | 0.79 | 1.00 | 0.80 | 0.73 | -0.15 | 0.31 | 0.71 | 0.50 |
| O2exp | 0.44 | 0.71 | -0.13 | -0.74 | 1.00 | 0.80 | 1.00 | 0.92 | -0.25 | 0.41 | 0.67 | 0.44 |
| CO2exp | 0.40 | 0.77 | -0.15 | -0.67 | 0.95 | 0.73 | 0.92 | 1.00 | -0.51 | 0.67 | 0.68 | 0.39 |
| FeO2 | -0.02 | -0.44 | 0.23 | 0.14 | -0.32 | -0.15 | -0.25 | -0.51 | 1.00 | -0.91 | -0.37 | -0.01 |
| FeCO2 | 0.16 | 0.56 | -0.19 | -0.31 | 0.47 | 0.31 | 0.41 | 0.67 | -0.91 | 1.00 | 0.42 | 0.12 |
| VO2.HR | 0.35 | 0.69 | 0.10 | -0.63 | 0.68 | 0.71 | 0.67 | 0.68 | -0.37 | 0.42 | 1.00 | 0.58 |
| HR | 0.32 | 0.39 | 0.14 | -0.47 | 0.44 | 0.50 | 0.44 | 0.39 | -0.01 | 0.12 | 0.58 | 1.00 |
Korrelatsioonimaatriksi heatmap visualiseermine. Eraldame andmestikust mittearvulised tunnused:
num_f=df.select_dtypes(exclude=object)
plt.figure(figsize=(16,16))
sns.heatmap(num_f.corr(), annot=True, fmt= '.2f')
plt.show()
Korrelatsiooni ülevaade diverging_palette'is.
plt.figure(figsize=(16,16))
sns.set(font_scale=1.0)
hm = sns.heatmap(num_f.corr(),
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 10},
yticklabels=num_f.columns,
xticklabels=num_f.columns,
cmap=sns.diverging_palette(10, 220, sep=30, n=256),
center=0.0)
plt.show()
Kõige tugevamad seosed on tunnustel:
- BR ja VE (-0.97)
- BR ja METS (-0.91)
- FeCO2 ja FeO2 (-0.91)
On piisavalt tugevaid positiivseid seoseid.
Clustermap järjestab read ja veerud sellisel viisil, et sarnaste väärtustega / lähedased veerud paiknevad diagrammil lähemal. Niivisi rganiseerides eelnevalt saadud korrelatsioonimaatriksid, näeme tunnuste rühmasid tugevate omavaheliste seostega. See võimaldab teha järeldust multikollineaarsuse kohta.
sns.set(font_scale=1.0)
km = sns.clustermap(num_f.corr(),
cbar=True,
annot=True,
fmt='.2f',
annot_kws={'size': 10},
yticklabels=num_f.columns,
xticklabels=num_f.columns,
cmap=sns.diverging_palette(10, 220, sep=30, n=256),
center=0.0)
plt.show()
Mõned tunnused on gruppeeritud esimesel tasandil. Rohkem kui kahe liikmega esimesel tasandil klastreid ei esine. Se tähendab, et tugevat multikollineaarsust ei ole.
Leiame arvuliste tunnuste korrelatsioonid (Pearsoni korrelatsioonikordajad) sihttunnusega, sorteerides need kasvavas järjekorras. See võimaldab näha, millised tunnused on kõige nõrgemas või tugevamas seoses sihttunnusega BR.
pd.set_option('display.float_format', lambda x: '%.4f' % x)
num_f.corrwith(df[_SIHTTUNNUS_]).sort_values()
VE -0.9704 METS -0.9097 O2exp -0.7376 VT -0.7272 CO2exp -0.6652 VO2.HR -0.6300 EEtot -0.5134 HR -0.4657 Rf -0.3460 FeCO2 -0.3060 FeO2 0.1410 BR 1.0000 dtype: float64
Kuvame seda graafiliselt.
plt.figure(dpi=130,figsize=(1,4))
sns.set(font_scale=0.8)
sns.heatmap(pd.DataFrame(num_f.corrwith(df[_SIHTTUNNUS_]).sort_values()), fmt='.2f',
annot=True, cmap=sns.diverging_palette(10, 220, sep=30, n=256),
center=0.0)
plt.show()
Kõige mõjukamad tunnused peale puhastamist ja esimese lineaarregressiooni mudeli sobitmist on:
- VE Expiratory Minute Ventilation (litre/min)
- METS Metabolic Equivalent per minute
- O2exp Volume of O2 expired (ml/min)
- VT Tidal Volume in litre
- CO2exp Volume of CO2 expired (ml/min)
- VO2.HR Oxygen pulse (ml/beat)
Selgub, et sügavam hingamine suurema hulga hapniku väljahingamisega mõjutab hingamise sageduse langemisele kõige rohkem. Samas, metabolismi ekvivalent maandab hingamise sageduse, mis on huvitav seos, väärt uurimist.
Sihttunnuse seos mittearvuliste tunnustega¶
Arvutame ja kuvame sihtuunnuse seoseid mittearvuliste tunnustega boxplot diagrammidega.
categ_columns=df.select_dtypes(include=object).columns
fig, axs = plt.subplots(len(categ_columns),1,dpi=95,figsize=(15,25))
i = 0
for col in categ_columns:
df.boxplot(
column=[_SIHTTUNNUS_],
by=col,
ax=axs[i],
patch_artist = True,
boxprops = dict(facecolor = "steelblue"),
medianprops = dict(color = "red", linewidth = 1.5)
)
i+=1
plt.suptitle('')
fig.tight_layout()
plt.show()
Tunnusega BR sõltuvuse struktuur on ühtlane, kuid tunnuse original_activity_labels väärtusi cycling, ning koos väärtusega walkingFast neid on kõige rohkem ja nad on hajutatud ühtlasem kui teised.
Uurime sihttunnuse varieeruvuse erinevate kategooriliste tunnuste väärtuste vahel. Saame ülevaate nende jaotusest ja statistilistest omadustest.
pd.set_option('display.float_format', lambda x: '%.2f' % x)
for col in categ_columns:
print(df.groupby(col)[_SIHTTUNNUS_].describe())
print()
count mean std min 25% 50% 75% max
gender
female 13909.00 82.22 10.11 34.00 78.00 85.00 90.00 99.00
male 24786.00 81.90 10.58 23.00 77.00 85.00 90.00 99.00
count mean std min 25% 50% 75% max
original_activity_labels
cycling 4212.00 64.75 11.82 23.00 57.00 64.00 73.00 99.00
dishwashing 1885.00 86.46 3.90 63.00 84.00 87.00 89.00 98.00
lyingDownLeft 1459.00 86.42 5.25 53.00 84.00 87.00 90.00 99.00
lyingDownRight 1307.00 90.19 3.40 62.00 89.00 91.00 92.00 98.00
sittingChair 1546.00 90.25 3.33 61.00 89.00 91.00 92.00 99.00
sittingCouch 1623.00 90.28 3.15 69.00 89.00 91.00 92.00 98.00
sittingSofa 1549.00 90.32 3.65 63.00 89.00 91.00 93.00 98.00
stakingShelves 1689.00 85.42 4.80 62.00 83.00 86.00 89.00 98.00
standing 1102.00 83.45 5.76 61.00 80.00 84.00 88.00 99.00
step 413.00 83.70 4.94 68.00 81.00 83.00 87.00 97.00
syncJumping 161.00 86.94 6.27 61.00 85.00 88.00 91.00 98.00
vacuumCleaning 1744.00 82.10 5.83 49.00 78.00 83.00 86.00 99.00
walkingFast 1997.00 72.11 8.75 45.00 67.00 72.00 77.00 99.00
walkingNormal 1883.00 76.98 7.32 43.00 72.00 77.00 82.00 98.00
walkingSlow 1694.00 82.11 5.78 62.00 78.00 82.00 86.00 98.00
walkingStairsUp 168.00 87.27 5.12 62.00 84.75 88.00 91.00 97.00
Kategoorias cycling on kõige sagedam, aga keskmine BR väärtus selle juures on kõige madalam, mis on päris huvitav fakt.
num_f=df.select_dtypes(exclude=object)
X = num_f.drop([_SIHTTUNNUS_],axis=1)
y = num_f[_SIHTTUNNUS_]
Treening- ja testandmete eraldamine¶
Selleks et kontrollida, kuidas ennustav mudel töötab uute andmetega, jagame andmestiku treening- (X_train, y_train) ja testandmeteks (X_test, y_test) jaotusega: 20% test- ja 80% treeningandmed. Selle jaoks kasutame mooduli model_selection funktsiooni train_test_split().
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Andmete standardiseerimine¶
Standardiseerimine aitab vähendada omaduste mõju teistele parameetritele, mis võivad olla erinevates suurustes. See tagab mudeli tõhusamat õppimist ja vähendab arvutuse aega. Näiteks, tunnus loudness erineb teistest tunnustest oma väärtuste vahemiku poolest.
pd.set_option('display.float_format', lambda x: '%.4f' % x)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
std_df = pd.DataFrame(X_train_std, columns=X.columns)
std_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| EEtot | 30956.0000 | -0.0000 | 1.0000 | -1.4077 | -0.7433 | -0.1782 | 0.5320 | 4.1906 |
| METS | 30956.0000 | 0.0000 | 1.0000 | -1.5789 | -0.7663 | -0.2625 | 0.5628 | 6.8422 |
| Rf | 30956.0000 | -0.0000 | 1.0000 | -1.9714 | -0.5403 | -0.1544 | 0.3578 | 32.8557 |
| VT | 30956.0000 | -0.0000 | 1.0000 | -1.9071 | -0.7096 | -0.1964 | 0.5062 | 5.6983 |
| VE | 30956.0000 | -0.0000 | 1.0000 | -1.6217 | -0.7266 | -0.2861 | 0.4337 | 6.1408 |
| O2exp | 30956.0000 | 0.0000 | 1.0000 | -1.9263 | -0.7007 | -0.2029 | 0.4874 | 7.3464 |
| CO2exp | 30956.0000 | 0.0000 | 1.0000 | -1.6377 | -0.7359 | -0.2141 | 0.5081 | 5.3928 |
| FeO2 | 30956.0000 | -0.0000 | 1.0000 | -6.2942 | -0.5822 | 0.0304 | 0.5923 | 6.5488 |
| FeCO2 | 30956.0000 | 0.0000 | 1.0000 | -4.5911 | -0.6006 | -0.0221 | 0.6297 | 4.4391 |
| VO2.HR | 30956.0000 | 0.0000 | 1.0000 | -1.5906 | -0.6696 | -0.0786 | 0.6387 | 5.4592 |
| HR | 30956.0000 | -0.0000 | 1.0000 | -2.2608 | -0.3523 | 0.0350 | 0.6158 | 3.3541 |
Andmed on standardiseeritud ja tunnuste väärtuste vahe on nüüd väiksem.
Mudeli loomine¶
Ehitame lineaarregressiooni mudeli funktsiooni LinearRegression() abil.
slr = LinearRegression()
slr.fit(X_train_std, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Mudeli testimine¶
Mudeli kordajad: Iga kordaja väärtus coef_ vastab tunnuse mõjule mudeli ennustustes.
slr.coef_
array([ 0.21141921, -4.86764595, -0.16945791, 4.94370014, -6.65009999,
-7.21399124, 3.17396404, -0.77187586, -1.11605182, 1.19794119,
-0.43799687])
Kordajate nimekiri:
for col_name, x_i in zip(X.columns, slr.coef_):
print(col_name + "\t", round(x_i, 4))
EEtot 0.2114 METS -4.8676 Rf -0.1695 VT 4.9437 VE -6.6501 O2exp -7.214 CO2exp 3.174 FeO2 -0.7719 FeCO2 -1.1161 VO2.HR 1.1979 HR -0.438
Mudeli kordajate visualiseerimine:
coefs = pd.DataFrame(slr.coef_, columns=["Coefficients"], index=X.columns)
coefs
| Coefficients | |
|---|---|
| EEtot | 0.2114 |
| METS | -4.8676 |
| Rf | -0.1695 |
| VT | 4.9437 |
| VE | -6.6501 |
| O2exp | -7.2140 |
| CO2exp | 3.1740 |
| FeO2 | -0.7719 |
| FeCO2 | -1.1161 |
| VO2.HR | 1.1979 |
| HR | -0.4380 |
Kõige suurema mõju mudeli ennustusvõimele avaldavad kordajad:
- O2exp
- VE
- VT
- METS
Kordajate visualiseerimine:
coefs.plot(kind="barh", figsize=(9, 7))
plt.title("MLR model")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
Kontrollime kordajate stabiilsust ehk nende varieeruvust mudeli korduval konstrueerimisel:
cv_model = cross_validate(
slr,
X_train_std,
y_train,
cv=10,
n_jobs=1
)
coefs = pd.DataFrame(
[slr.coef_ for model in cv_model],
columns=X.columns,
)
plt.figure(figsize=(9, 7))
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5)
plt.axvline(x=0, color=".5")
plt.xlabel("Coefficient importance")
plt.title("Coefficient importance and its variability")
plt.subplots_adjust(left=0.3)
Kordajad ei ole tasakaalus nullpunkti suhtes. Mitte ühtlane ja kõrge varieeruvus kordajate ümber viitab sellele, et mudel ei ole stabiilne ja kordajad ei ole järjepidevad erinevatel andmestikel mudeli korduval konstrueerimisel.
Leiame lineaarse regressiooni mudeli täpsust ristvalideerimise abil, arvutades keskmise R2 täpsuse ristvalideerimise tulemuste põhjal. Saame eraldi täpsused treening- ja testandmetel.
scores = cross_val_score(estimator=slr,
X=X_train_std,
y=y_train,
cv=10,
n_jobs=1)
print('CV keskmine R2 täpsus: %.3f' % np.mean(scores), "+/- %.3f" % np.std(scores))
print('R2 täpsus treeningandmetel: %.3f' % slr.score(X_train_std, y_train))
print('R2 täpsus testandmetel: %.3f' % slr.score(X_test_std, y_test))
CV keskmine R2 täpsus: 0.970 +/- 0.001 R2 täpsus treeningandmetel: 0.970 R2 täpsus testandmetel: 0.970
Mudeli RMSE leidmine:
scores = cross_val_score(estimator=slr,
X=X_train_std,
y=y_train,
scoring = 'neg_mean_squared_error',
cv=10,
n_jobs=1)
print('CV keskmine RMSE: %.3f' % np.mean(np.sqrt(np.abs(scores))), "+/- %.3f" % np.std(np.sqrt(np.abs(scores))))
print('RMSE treeningandmetel: %.3f' % np.sqrt(mean_squared_error(y_train,slr.predict(X_train_std))))
print('RMSE testandmetel: %.3f' % np.sqrt(mean_squared_error(y_test,slr.predict(X_test_std))))
CV keskmine RMSE: 1.808 +/- 0.042 RMSE treeningandmetel: 1.807 RMSE testandmetel: 1.786
Mudeli jäägid ehk vead:
residuals=y_train-slr.predict(X_train_std)
Mudeli standardiseeritud jäägid ehk vead:
std_residuals=residuals/np.std(residuals)
Mudeli diagnostika diagrammid:
fig, axs = plt.subplots(2,2,dpi=95,figsize=(15,15))
plt.style.use("seaborn-v0_8-whitegrid")
# Residual against fitted values
axs[0, 0].scatter(x=slr.predict(X_train_std), y=std_residuals)
axs[0, 0].axhline(y=0, color='red', linestyle='dashed')
axs[0, 0].set_xlabel('Fitted Values')
axs[0, 0].set_ylabel('Std. Residuals')
axs[0, 0].set_title('Residuals vs Fitted')
# normal qqplot
stats.probplot(std_residuals, plot=axs[0, 1])
#sm.qqplot(std_residuals, dist=stats.t, fit=True, line='45', c='#4C72B0',ax=axs[0, 1])
axs[0, 1].set_title('Normal Q-Q')
# Fitted values against actual values
axs[1, 0].scatter(x=y_train, y=slr.predict(X_train_std))
axs[1, 0].plot(y_train, y_train, color='red', linestyle='dashed')
axs[1, 0].set_xlabel('Actual Values')
axs[1, 0].set_ylabel('Fitted Values')
axs[1, 0].set_title('Fitted vs Actual')
# Histogram of std. residuals
axs[1, 1].hist(std_residuals, density=True)
x = np.linspace(min(std_residuals),max(std_residuals), 500)
axs[1, 1].plot(x, norm.pdf(x),color='red')
axs[1, 1].set_xlabel('Std. Residuals')
axs[1, 1].set_title('Std. Residuals Density Plot')
fig.tight_layout()
plt.show()
Residuals vs Fitted ehk Jäägid vs Prognoosid: punktid ei ole juhuslikult hajutatud x=0 joone ümber. Mõne funktsionaalsuse olemasolu viitab mudeli parandamise võimalusele nt kõrgemate astmetega komponentide lisamise teel.
Sihttunnus Y on vasakpoolse asümmeetriaga. Järelikult, logaritmi teisenduse kasutamine võib olla vähem efektiivne või ebatõhus. Logaritmi teisendus on tavaliselt efektiivne parempoolse asümmeetriaga tunnuste korral, kuna see aitab muuta jaotuse sümmeetrilisemaks. Vasakpoolse asümmeetriaga tunnuste puhul võib logaritmi teisendus aga kaotada olulist teavet või muuta andmed ebatäpselt interpreteeritavaks.
Normal Q-Q ehk kvantiil-kvantiil diagramm: Mida lähedam standardiseeritud jääkide kvantiilid standardse normaaljaotuse kvantiilidele, seda parem.
Actual vs Fitted ehk Tegelikud väärtused vs Prognoosid diagramm: Mudel on piisavalt hästi sobitatud treeningandmetele, kuna punktid paiknevad y=x punase sirgjoone lähedal.
Mudel prognoosib erineva BR väärtusega lauausid ühtlaselt.
Lineaarregressiooni mudel kategooriliste tunnustega¶
Teisendame kategoorilised tunnused, mis võimaldab neid kasutada mudelis. Teeme uue andmestruktuuri, kus iga kategooriline tunnus on asendatud mitme uue tunnusega, eeldusega, et konkreetne vaatlus vastab selle kategooria jaoks (1). Kõige esimene kategooria jäetud välja, et vältida dummy muutujate omavahelist sõltuvust. Uus andmestruktuur sisaldab kategooriliste tunnuste asendusi dummy muutujatega.
X_dummy = pd.get_dummies(data=df.drop([_SIHTTUNNUS_],axis=1), drop_first=True)
X_dummy.head
<bound method NDFrame.head of EEtot METS Rf VT VE O2exp CO2exp FeO2 FeCO2 \
0 0.0000 0.9528 21.5054 0.4091 8.7978 73.1476 12.4463 17.8803 3.0424
1 0.0097 1.0143 24.5902 0.4642 11.4144 85.4919 11.8342 18.4176 2.5495
2 0.0940 1.9223 22.5564 0.8774 19.7902 159.3537 25.3007 18.1628 2.8837
3 0.2080 2.0189 20.0000 0.9243 18.4858 164.5567 30.6057 17.8035 3.3113
4 0.2703 1.2154 22.0588 0.5876 12.9624 107.3240 16.2210 18.2639 2.7604
... ... ... ... ... ... ... ... ... ...
38718 85.4218 1.2031 21.8182 0.5650 12.3281 102.0947 13.8710 18.0686 2.4549
38719 85.4871 0.9703 19.8675 0.5314 10.5573 96.9951 11.8312 18.2534 2.2265
38720 85.5828 2.2391 26.7857 0.6436 17.2386 110.4581 19.6846 17.1632 3.0586
38721 85.7260 1.3556 14.9254 0.7017 10.4733 120.3514 22.2344 17.1512 3.1686
38722 86.1708 1.4569 24.7934 0.6028 14.9449 109.3362 12.8511 18.1387 2.1320
VO2.HR ... original_activity_labels_sittingSofa \
0 2.1250 ... False
1 2.2403 ... False
2 4.2051 ... False
3 4.3329 ... False
4 2.6086 ... False
... ... ... ...
38718 3.9375 ... False
38719 3.1757 ... False
38720 7.3279 ... False
38721 4.4365 ... False
38722 4.7681 ... False
original_activity_labels_stakingShelves \
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
original_activity_labels_standing original_activity_labels_step \
0 False False
1 False False
2 False False
3 False False
4 False False
... ... ...
38718 False False
38719 False False
38720 False False
38721 False False
38722 False False
original_activity_labels_syncJumping \
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
original_activity_labels_vacuumCleaning \
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
original_activity_labels_walkingFast \
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
original_activity_labels_walkingNormal \
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
original_activity_labels_walkingSlow \
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
original_activity_labels_walkingStairsUp
0 False
1 False
2 False
3 False
4 False
... ...
38718 False
38719 False
38720 False
38721 False
38722 False
[38695 rows x 27 columns]>
X_dummy
| EEtot | METS | Rf | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | ... | original_activity_labels_sittingSofa | original_activity_labels_stakingShelves | original_activity_labels_standing | original_activity_labels_step | original_activity_labels_syncJumping | original_activity_labels_vacuumCleaning | original_activity_labels_walkingFast | original_activity_labels_walkingNormal | original_activity_labels_walkingSlow | original_activity_labels_walkingStairsUp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0000 | 0.9528 | 21.5054 | 0.4091 | 8.7978 | 73.1476 | 12.4463 | 17.8803 | 3.0424 | 2.1250 | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | 0.0097 | 1.0143 | 24.5902 | 0.4642 | 11.4144 | 85.4919 | 11.8342 | 18.4176 | 2.5495 | 2.2403 | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | 0.0940 | 1.9223 | 22.5564 | 0.8774 | 19.7902 | 159.3537 | 25.3007 | 18.1628 | 2.8837 | 4.2051 | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | 0.2080 | 2.0189 | 20.0000 | 0.9243 | 18.4858 | 164.5567 | 30.6057 | 17.8035 | 3.3113 | 4.3329 | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | 0.2703 | 1.2154 | 22.0588 | 0.5876 | 12.9624 | 107.3240 | 16.2210 | 18.2639 | 2.7604 | 2.6086 | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 38718 | 85.4218 | 1.2031 | 21.8182 | 0.5650 | 12.3281 | 102.0947 | 13.8710 | 18.0686 | 2.4549 | 3.9375 | ... | False | False | False | False | False | False | False | False | False | False |
| 38719 | 85.4871 | 0.9703 | 19.8675 | 0.5314 | 10.5573 | 96.9951 | 11.8312 | 18.2534 | 2.2265 | 3.1757 | ... | False | False | False | False | False | False | False | False | False | False |
| 38720 | 85.5828 | 2.2391 | 26.7857 | 0.6436 | 17.2386 | 110.4581 | 19.6846 | 17.1632 | 3.0586 | 7.3279 | ... | False | False | False | False | False | False | False | False | False | False |
| 38721 | 85.7260 | 1.3556 | 14.9254 | 0.7017 | 10.4733 | 120.3514 | 22.2344 | 17.1512 | 3.1686 | 4.4365 | ... | False | False | False | False | False | False | False | False | False | False |
| 38722 | 86.1708 | 1.4569 | 24.7934 | 0.6028 | 14.9449 | 109.3362 | 12.8511 | 18.1387 | 2.1320 | 4.7681 | ... | False | False | False | False | False | False | False | False | False | False |
38695 rows × 27 columns
Treening- ja testandmete eraldamine
X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, test_size=0.2, random_state=0)
Andmete standardiseerimine
Standardiseerime arvulised prediktorid ja ühendame neid teiste prediktoritega. Alguses eraldame arvulised tunnused:
X_train_num = X_train[num_f.drop([_SIHTTUNNUS_],axis=1).columns]
X_train_num.head()
| EEtot | METS | Rf | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | HR | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7985 | 120.9925 | 1.0065 | 9.9834 | 1.2768 | 12.7465 | 232.7153 | 31.6134 | 18.2268 | 2.4760 | 4.6235 | 64 |
| 16281 | 144.2626 | 3.4369 | 31.4136 | 0.9985 | 31.3650 | 174.7038 | 35.1855 | 17.4974 | 3.5240 | 6.5937 | 135 |
| 9594 | 103.5986 | 1.5816 | 18.2927 | 0.9607 | 17.5744 | 167.2611 | 29.5767 | 17.4098 | 3.0786 | 7.8152 | 68 |
| 9242 | 51.8238 | 2.3240 | 26.6667 | 0.9985 | 26.6258 | 174.7062 | 30.6985 | 17.4974 | 3.0746 | 11.8315 | 66 |
| 17936 | 129.4891 | 7.6673 | 32.2581 | 1.6683 | 53.8148 | 286.4388 | 65.4659 | 17.1699 | 3.9242 | 0.0000 | 0 |
X_test_num = X_test[num_f.drop([_SIHTTUNNUS_],axis=1).columns]
X_test_num.head()
| EEtot | METS | Rf | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | HR | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 14331 | 60.2944 | 1.5900 | 22.3048 | 0.6506 | 14.5125 | 113.0974 | 16.8269 | 17.3824 | 2.5862 | 6.3275 | 73 |
| 3717 | 118.9672 | 3.1380 | 28.3019 | 1.1517 | 32.5956 | 197.4944 | 43.9670 | 17.1479 | 3.8175 | 10.5206 | 95 |
| 6326 | 46.2270 | 1.5039 | 22.1402 | 0.7110 | 15.7408 | 126.2795 | 19.9925 | 17.7618 | 2.8121 | 5.6145 | 75 |
| 33591 | 99.7225 | 1.6865 | 18.8679 | 0.8608 | 16.2412 | 153.3907 | 22.7434 | 17.8199 | 2.6422 | 5.6727 | 77 |
| 9741 | 129.1196 | 1.5356 | 20.6897 | 0.9097 | 18.8221 | 161.0398 | 27.0269 | 17.7018 | 2.9709 | 7.2668 | 71 |
Standardiseerimine:
sc.fit(X_train_num)
X_train_std = sc.transform(X_train_num)
X_test_std = sc.transform(X_test_num)
Ühendame prediktoreid:
X_train_num.columns
Index(['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2',
'VO2.HR', 'HR'],
dtype='object')
X_train.columns
Index(['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2',
'VO2.HR', 'HR', 'gender_male', 'original_activity_labels_dishwashing',
'original_activity_labels_lyingDownLeft',
'original_activity_labels_lyingDownRight',
'original_activity_labels_sittingChair',
'original_activity_labels_sittingCouch',
'original_activity_labels_sittingSofa',
'original_activity_labels_stakingShelves',
'original_activity_labels_standing', 'original_activity_labels_step',
'original_activity_labels_syncJumping',
'original_activity_labels_vacuumCleaning',
'original_activity_labels_walkingFast',
'original_activity_labels_walkingNormal',
'original_activity_labels_walkingSlow',
'original_activity_labels_walkingStairsUp'],
dtype='object')
dummy_col=X_train.columns[~X_train.columns.isin(X_train_num.columns)]
dummy_col
Index(['gender_male', 'original_activity_labels_dishwashing',
'original_activity_labels_lyingDownLeft',
'original_activity_labels_lyingDownRight',
'original_activity_labels_sittingChair',
'original_activity_labels_sittingCouch',
'original_activity_labels_sittingSofa',
'original_activity_labels_stakingShelves',
'original_activity_labels_standing', 'original_activity_labels_step',
'original_activity_labels_syncJumping',
'original_activity_labels_vacuumCleaning',
'original_activity_labels_walkingFast',
'original_activity_labels_walkingNormal',
'original_activity_labels_walkingSlow',
'original_activity_labels_walkingStairsUp'],
dtype='object')
X_train_std=pd.DataFrame(X_train_std, columns=X_train_num.columns).join(X_train[dummy_col].reset_index()).drop(['index'],axis=1)
X_train_std.head()
| EEtot | METS | Rf | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | ... | original_activity_labels_sittingSofa | original_activity_labels_stakingShelves | original_activity_labels_standing | original_activity_labels_step | original_activity_labels_syncJumping | original_activity_labels_vacuumCleaning | original_activity_labels_walkingFast | original_activity_labels_walkingNormal | original_activity_labels_walkingSlow | original_activity_labels_walkingStairsUp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.9165 | -1.0224 | -1.3076 | 0.1345 | -0.8833 | 0.2415 | -0.3129 | 0.9779 | -0.9337 | -0.6636 | ... | False | False | False | False | False | True | False | False | False | False |
| 1 | 1.3635 | 0.3213 | 0.6981 | -0.3264 | 0.2122 | -0.3180 | -0.1632 | 0.0106 | 0.6142 | -0.2686 | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | 0.5823 | -0.7044 | -0.5299 | -0.3889 | -0.5992 | -0.3898 | -0.3982 | -0.1057 | -0.0438 | -0.0237 | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | -0.4122 | -0.2940 | 0.2538 | -0.3264 | -0.0667 | -0.3180 | -0.3512 | 0.0106 | -0.0497 | 0.7816 | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | 1.0797 | 2.6602 | 0.7772 | 0.7827 | 1.5331 | 0.7596 | 1.1058 | -0.4238 | 1.2053 | -1.5906 | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 27 columns
X_test_std=pd.DataFrame(X_test_std, columns=X_test_num.columns).join(X_test[dummy_col].reset_index()).drop(['index'],axis=1)
X_test_std.head()
| EEtot | METS | Rf | VT | VE | O2exp | CO2exp | FeO2 | FeCO2 | VO2.HR | ... | original_activity_labels_sittingSofa | original_activity_labels_stakingShelves | original_activity_labels_standing | original_activity_labels_step | original_activity_labels_syncJumping | original_activity_labels_vacuumCleaning | original_activity_labels_walkingFast | original_activity_labels_walkingNormal | original_activity_labels_walkingSlow | original_activity_labels_walkingStairsUp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.2495 | -0.6998 | -0.1544 | -0.9023 | -0.7794 | -0.9122 | -0.9325 | -0.1420 | -0.7710 | -0.3220 | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | 0.8776 | 0.1560 | 0.4069 | -0.0726 | 0.2846 | -0.0982 | 0.2048 | -0.4530 | 1.0478 | 0.5188 | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | -0.5197 | -0.7474 | -0.1698 | -0.8025 | -0.7071 | -0.7851 | -0.7999 | 0.3612 | -0.4374 | -0.4649 | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | 0.5079 | -0.6465 | -0.4761 | -0.5544 | -0.6777 | -0.5236 | -0.6846 | 0.4382 | -0.6883 | -0.4532 | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | 1.0726 | -0.7299 | -0.3056 | -0.4733 | -0.5258 | -0.4498 | -0.5051 | 0.2816 | -0.2029 | -0.1336 | ... | False | False | False | False | False | True | False | False | False | False |
5 rows × 27 columns
X_train_std.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| EEtot | 30956.0000 | -0.0000 | 1.0000 | -1.4077 | -0.7433 | -0.1782 | 0.5320 | 4.1906 |
| METS | 30956.0000 | 0.0000 | 1.0000 | -1.5789 | -0.7663 | -0.2625 | 0.5628 | 6.8422 |
| Rf | 30956.0000 | -0.0000 | 1.0000 | -1.9714 | -0.5403 | -0.1544 | 0.3578 | 32.8557 |
| VT | 30956.0000 | -0.0000 | 1.0000 | -1.9071 | -0.7096 | -0.1964 | 0.5062 | 5.6983 |
| VE | 30956.0000 | -0.0000 | 1.0000 | -1.6217 | -0.7266 | -0.2861 | 0.4337 | 6.1408 |
| O2exp | 30956.0000 | 0.0000 | 1.0000 | -1.9263 | -0.7007 | -0.2029 | 0.4874 | 7.3464 |
| CO2exp | 30956.0000 | 0.0000 | 1.0000 | -1.6377 | -0.7359 | -0.2141 | 0.5081 | 5.3928 |
| FeO2 | 30956.0000 | -0.0000 | 1.0000 | -6.2942 | -0.5822 | 0.0304 | 0.5923 | 6.5488 |
| FeCO2 | 30956.0000 | 0.0000 | 1.0000 | -4.5911 | -0.6006 | -0.0221 | 0.6297 | 4.4391 |
| VO2.HR | 30956.0000 | 0.0000 | 1.0000 | -1.5906 | -0.6696 | -0.0786 | 0.6387 | 5.4592 |
| HR | 30956.0000 | -0.0000 | 1.0000 | -2.2608 | -0.3523 | 0.0350 | 0.6158 | 3.3541 |
Mudeli loomine
slr = LinearRegression()
slr.fit(X_train_std, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Mudeli testimine
for col_name, x_i in zip(X_train_std.columns, slr.coef_):
print(col_name + "\t", round(x_i, 4))
EEtot 0.3412 METS -3.598 Rf -0.1433 VT -1.9545 VE -7.7387 O2exp -1.0557 CO2exp 3.7042 FeO2 -0.9021 FeCO2 -1.1345 VO2.HR 1.0252 HR -0.431 gender_male 1.6887 original_activity_labels_dishwashing -0.031 original_activity_labels_lyingDownLeft -0.0819 original_activity_labels_lyingDownRight 0.1211 original_activity_labels_sittingChair 0.1296 original_activity_labels_sittingCouch 0.2812 original_activity_labels_sittingSofa 0.3217 original_activity_labels_stakingShelves -0.0936 original_activity_labels_standing 0.679 original_activity_labels_step 0.5068 original_activity_labels_syncJumping 0.8992 original_activity_labels_vacuumCleaning -0.2189 original_activity_labels_walkingFast 0.1547 original_activity_labels_walkingNormal 0.4252 original_activity_labels_walkingSlow 0.6267 original_activity_labels_walkingStairsUp 0.1982
coefs = pd.DataFrame(
slr.coef_, columns=["Coefficients"], index=X_train_std.columns)
coefs
| Coefficients | |
|---|---|
| EEtot | 0.3412 |
| METS | -3.5980 |
| Rf | -0.1433 |
| VT | -1.9545 |
| VE | -7.7387 |
| O2exp | -1.0557 |
| CO2exp | 3.7042 |
| FeO2 | -0.9021 |
| FeCO2 | -1.1345 |
| VO2.HR | 1.0252 |
| HR | -0.4310 |
| gender_male | 1.6887 |
| original_activity_labels_dishwashing | -0.0310 |
| original_activity_labels_lyingDownLeft | -0.0819 |
| original_activity_labels_lyingDownRight | 0.1211 |
| original_activity_labels_sittingChair | 0.1296 |
| original_activity_labels_sittingCouch | 0.2812 |
| original_activity_labels_sittingSofa | 0.3217 |
| original_activity_labels_stakingShelves | -0.0936 |
| original_activity_labels_standing | 0.6790 |
| original_activity_labels_step | 0.5068 |
| original_activity_labels_syncJumping | 0.8992 |
| original_activity_labels_vacuumCleaning | -0.2189 |
| original_activity_labels_walkingFast | 0.1547 |
| original_activity_labels_walkingNormal | 0.4252 |
| original_activity_labels_walkingSlow | 0.6267 |
| original_activity_labels_walkingStairsUp | 0.1982 |
coefs.plot(kind="barh", figsize=(9, 7))
plt.title("MLR model")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
Kõige kõrgema mõjuga koefitsiendid on:
- VE
- CO2exp
- METS
Teeme ristvalideerimist mudeli täpsuse hindamiseks ja hindame mudeli keskmise täpsuse 10 erineva ristvalideerimise iteratsiooni põhjal.
scores = cross_val_score(estimator=slr,
X=X_train_std,
y=y_train,
cv=10,
n_jobs=1)
print('CV keskmine R2 täpsus: %.3f' % np.mean(scores), "+/- %.3f" % np.std(scores))
print('R2 täpsus treeningandmetel: %.3f' % slr.score(X_train_std, y_train))
print('R2 täpsus testandmetel: %.3f' % slr.score(X_test_std, y_test))
CV keskmine R2 täpsus: 0.974 +/- 0.001 R2 täpsus treeningandmetel: 0.974 R2 täpsus testandmetel: 0.974
Arvutame lineaarse regressiooni mudeli RMSE treeningandmetel, kasutades mean_squared_error funktsiooni, et saada RMSE treeningandmetel ja seejärel võtta sellest ruutjuur. See annab hinnangu sellele, kui hästi mudel ennustab treeningandmeid.
scores = cross_val_score(estimator=slr,
X=X_train_std,
y=y_train,
scoring = 'neg_mean_squared_error',
cv=10,
n_jobs=1)
print('CV keskmine RMSE: %.3f' % np.mean(np.sqrt(np.abs(scores))), "+/- %.3f" % np.std(np.sqrt(np.abs(scores))))
print('RMSE treeningandmetel: %.3f' % np.sqrt(mean_squared_error(y_train,slr.predict(X_train_std))))
print('RMSE testandmetel: %.3f' % np.sqrt(mean_squared_error(y_test,slr.predict(X_test_std))))
CV keskmine RMSE: 1.673 +/- 0.035 RMSE treeningandmetel: 1.671 RMSE testandmetel: 1.653
Mudeli jäägid ehk vead:
residuals=y_train-slr.predict(X_train_std)
Mudeli standardiseeritud jäägid ehk vead:
std_residuals=residuals/np.std(residuals)
Mudeli diagnostika graafikud:
fig, axs = plt.subplots(2,2,dpi=95,figsize=(15,15))
plt.style.use("seaborn-v0_8-whitegrid")
# Residual against fitted values
axs[0, 0].scatter(x=slr.predict(X_train_std), y=std_residuals)
axs[0, 0].axhline(y=0, color='red', linestyle='dashed')
axs[0, 0].set_xlabel('Fitted Values')
axs[0, 0].set_ylabel('Std. Residuals')
axs[0, 0].set_title('Residuals vs Fitted')
# normal qqplot
stats.probplot(std_residuals, plot=axs[0, 1])
#sm.qqplot(std_residuals, dist=stats.t, fit=True, line='45', c='#4C72B0',ax=axs[0, 1])
axs[0, 1].set_title('Normal Q-Q')
# Fitted values against actual values
axs[1, 0].scatter(x=y_train, y=slr.predict(X_train_std))
axs[1, 0].plot(y_train, y_train, color='red', linestyle='dashed')
axs[1, 0].set_xlabel('Actual Values')
axs[1, 0].set_ylabel('Fitted Values')
axs[1, 0].set_title('Fitted vs Actual')
# Histogram of std. residuals
axs[1, 1].hist(std_residuals, density=True)
x = np.linspace(min(std_residuals),max(std_residuals), 500)
axs[1, 1].plot(x, norm.pdf(x),color='red')
axs[1, 1].set_xlabel('Std. Residuals')
axs[1, 1].set_title('Std. Residuals Density Plot')
fig.tight_layout()
plt.show()
Diagnostika diagrammid näitavad sarnased tulemused Lineaarregressiooni mudeliga ainult numbrilistel tunnustel. Samas, diagramm Residuals vs. Fitted näitam rohkem jääkide hajutatavust. See mudel on usaldusväärsem.
X = df.drop([_SIHTTUNNUS_],axis=1)
y = df[_SIHTTUNNUS_]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Eraldame kategoorilised ja numbrilised muutujad, kasutades nende identifitseerimiseks nende andmetüüpe. Nagu nägime eelnevalt, objekt vastab kategoorilistele veergudele. Kasutame vastavate veergude valimiseks make_column_selector.
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
numerical_columns
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
categorical_columns
['gender', 'original_activity_labels']
categorical_preprocessor = OneHotEncoder(drop='first')
Eelprtsessor numbriliste tunnuste jaoks peab sisaldama ka polünoommudeli astmed, seega kasutame konveieri:
numerical_preprocessor = StandardScaler()
numerical_preprocessor = Pipeline([
('scaler', StandardScaler()),
('poly2', PolynomialFeatures(degree=2))
])
Nüüd loome ColumnTransfomer ja seostame eelprotsessorid vastavate veergudega:
preprocessor = ColumnTransformer(
[
("ctg", categorical_preprocessor, categorical_columns),
("num", numerical_preprocessor, numerical_columns),
]
)
Nüüd loome konveieri (pipeline), mis ühendab ColumnTransformer mudeliga:
poly_lr = Pipeline([
('pre', preprocessor),
('lr', LinearRegression())
])
Mudeli loomine treeningandmetel¶
poly_lr.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num',
Pipeline(steps=[('scaler',
StandardScaler()),
('poly2',
PolynomialFeatures())]),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('lr', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num',
Pipeline(steps=[('scaler',
StandardScaler()),
('poly2',
PolynomialFeatures())]),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('lr', LinearRegression())])ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
['gender', 'original_activity_labels']),
('num',
Pipeline(steps=[('scaler', StandardScaler()),
('poly2',
PolynomialFeatures())]),
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
PolynomialFeatures()
LinearRegression()
Mudeli testimine
preprocessor.get_feature_names_out()
array(['ctg__gender_male', 'ctg__original_activity_labels_dishwashing',
'ctg__original_activity_labels_lyingDownLeft',
'ctg__original_activity_labels_lyingDownRight',
'ctg__original_activity_labels_sittingChair',
'ctg__original_activity_labels_sittingCouch',
'ctg__original_activity_labels_sittingSofa',
'ctg__original_activity_labels_stakingShelves',
'ctg__original_activity_labels_standing',
'ctg__original_activity_labels_step',
'ctg__original_activity_labels_syncJumping',
'ctg__original_activity_labels_vacuumCleaning',
'ctg__original_activity_labels_walkingFast',
'ctg__original_activity_labels_walkingNormal',
'ctg__original_activity_labels_walkingSlow',
'ctg__original_activity_labels_walkingStairsUp',
'ctg__original_activity_labels_nan', 'num__1', 'num__EEtot',
'num__METS', 'num__Rf', 'num__VT', 'num__VE', 'num__O2exp',
'num__CO2exp', 'num__FeO2', 'num__FeCO2', 'num__VO2.HR', 'num__HR',
'num__EEtot^2', 'num__EEtot METS', 'num__EEtot Rf',
'num__EEtot VT', 'num__EEtot VE', 'num__EEtot O2exp',
'num__EEtot CO2exp', 'num__EEtot FeO2', 'num__EEtot FeCO2',
'num__EEtot VO2.HR', 'num__EEtot HR', 'num__METS^2',
'num__METS Rf', 'num__METS VT', 'num__METS VE', 'num__METS O2exp',
'num__METS CO2exp', 'num__METS FeO2', 'num__METS FeCO2',
'num__METS VO2.HR', 'num__METS HR', 'num__Rf^2', 'num__Rf VT',
'num__Rf VE', 'num__Rf O2exp', 'num__Rf CO2exp', 'num__Rf FeO2',
'num__Rf FeCO2', 'num__Rf VO2.HR', 'num__Rf HR', 'num__VT^2',
'num__VT VE', 'num__VT O2exp', 'num__VT CO2exp', 'num__VT FeO2',
'num__VT FeCO2', 'num__VT VO2.HR', 'num__VT HR', 'num__VE^2',
'num__VE O2exp', 'num__VE CO2exp', 'num__VE FeO2', 'num__VE FeCO2',
'num__VE VO2.HR', 'num__VE HR', 'num__O2exp^2',
'num__O2exp CO2exp', 'num__O2exp FeO2', 'num__O2exp FeCO2',
'num__O2exp VO2.HR', 'num__O2exp HR', 'num__CO2exp^2',
'num__CO2exp FeO2', 'num__CO2exp FeCO2', 'num__CO2exp VO2.HR',
'num__CO2exp HR', 'num__FeO2^2', 'num__FeO2 FeCO2',
'num__FeO2 VO2.HR', 'num__FeO2 HR', 'num__FeCO2^2',
'num__FeCO2 VO2.HR', 'num__FeCO2 HR', 'num__VO2.HR^2',
'num__VO2.HR HR', 'num__HR^2'], dtype=object)
poly_lr.named_steps['lr'].coef_
array([ 1.11588524e+00, -3.35223584e-02, -6.45980021e-02, -1.76905759e-01,
-8.69381913e-02, -4.17115094e-02, -1.53874216e-01, 1.12142675e-01,
2.75179062e-01, -1.12167018e-01, 2.49311864e-01, -1.39661167e-03,
-6.66572688e-02, -2.50763177e-02, 1.42887814e-01, -6.78430960e-02,
-7.31760935e-02, -2.53504494e+03, 6.31088389e-02, -6.10776529e+00,
2.70223687e+08, -4.05085758e+08, -1.74074102e+08, 3.44535832e+08,
2.57149489e+08, -5.28415641e+07, -6.19166777e+07, 7.29776039e+00,
4.56303008e+00, 1.21282010e-02, -3.55187891e-01, -2.90894779e-02,
-1.11653483e+00, 5.71564049e-01, 7.31470227e-01, 3.85120749e-01,
-6.55072629e-02, -3.65646631e-02, -1.36274666e-01, -9.61393714e-02,
-7.29222342e-01, 1.59336299e-01, 2.72875926e+01, 2.32483765e+00,
-1.94251935e+01, -6.73374751e+00, -1.92198621e+00, -2.62710661e-01,
-2.22766016e+00, -1.88767031e-01, 3.10957432e-04, 4.44047038e+06,
1.36789352e-01, -1.12966873e+09, 1.54127620e+09, -3.34736034e-02,
5.02254311e-02, -3.12185787e-01, -8.24527442e-02, -3.89861885e+01,
-3.69538771e+01, 5.26733545e+01, 3.23250163e+01, 3.13555431e+07,
-2.48821917e+08, -5.35449639e-01, -7.76449083e+00, -1.18688235e+00,
2.89877477e+01, 7.32735864e+00, 1.30676841e+08, -6.95598797e+08,
1.13887857e+00, -1.41304341e+00, -1.65728529e+01, -2.49536350e+01,
-1.22697956e+01, 7.82218382e+08, -3.96077499e-01, 5.49233616e+00,
-4.65898836e+00, -2.00492064e+08, -2.04347900e-01, 5.53791635e-01,
2.22513276e+00, 2.94494927e-01, 6.73780438e-01, 6.71497986e-01,
2.81234711e-01, 2.72435844e-01, 3.55855571e-01, -2.13263795e-01,
8.28346810e-01, 5.19687173e+00, 4.30181950e-01])
coefs = pd.DataFrame(
poly_lr.named_steps['lr'].coef_, columns=["Coefficients"], index=preprocessor.get_feature_names_out())
coefs
| Coefficients | |
|---|---|
| ctg__gender_male | 1.1159 |
| ctg__original_activity_labels_dishwashing | -0.0335 |
| ctg__original_activity_labels_lyingDownLeft | -0.0646 |
| ctg__original_activity_labels_lyingDownRight | -0.1769 |
| ctg__original_activity_labels_sittingChair | -0.0869 |
| ... | ... |
| num__FeCO2 VO2.HR | 0.3559 |
| num__FeCO2 HR | -0.2133 |
| num__VO2.HR^2 | 0.8283 |
| num__VO2.HR HR | 5.1969 |
| num__HR^2 | 0.4302 |
95 rows × 1 columns
coefs.plot(kind="barh", figsize=(9, 12))
plt.title("Poly model")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left=0.3)
Kategoriaalsed tunnused on kaotanud tähtsuse sellel mudelis võrreldes Lineaarse regressiooni mudeliga kategoriaalsete tunnustega.
scores = cross_val_score(estimator=poly_lr,
X=X_train,
y=y_train,
cv=10,
n_jobs=1)
print('CV keskmine R2 täpsus: %.3f' % np.mean(scores), "+/- %.3f" % np.std(scores))
print('Keskmine R2 täpsus treeningandmetel: %.3f' % poly_lr.score(X_train, y_train))
print('Keskmine R2 täpsus testandmetel: %.3f' % poly_lr.score(X_test, y_test))
CV keskmine R2 täpsus: 0.984 +/- 0.001 Keskmine R2 täpsus treeningandmetel: 0.985 Keskmine R2 täpsus testandmetel: 0.984
Mudeli RMSE leidmine:
scores = cross_val_score(estimator=poly_lr,
X=X_train,
y=y_train,
scoring = 'neg_mean_squared_error',
cv=10,
n_jobs=1)
print('CV keskmine RMSE: %.3f' % np.mean(np.sqrt(np.abs(scores))), "+/- %.3f" % np.std(np.sqrt(np.abs(scores))))
print('RMSE treeningandmetel: %.3f' % np.sqrt(mean_squared_error(y_train,poly_lr.predict(X_train))))
print('RMSE testandmetel: %.3f' % np.sqrt(mean_squared_error(y_test,poly_lr.predict(X_test))))
CV keskmine RMSE: 1.309 +/- 0.029 RMSE treeningandmetel: 1.300 RMSE testandmetel: 1.290
Mudeli jäägid ehk vead:
residuals=y_train-poly_lr.predict(X_train)
Mudeli standardiseeritud jäägid ehk vead:
std_residuals=residuals/np.std(residuals)
Mudeli diagnostika graafikud:
fig, axs = plt.subplots(2,2,dpi=95,figsize=(15,15))
# plt.style.use('seaborn')
# Residual against fitted values
axs[0, 0].scatter(x=poly_lr.predict(X_train), y=std_residuals)
axs[0, 0].axhline(y=0, color='red', linestyle='dashed')
axs[0, 0].set_xlabel('Fitted Values')
axs[0, 0].set_ylabel('Std. Residuals')
axs[0, 0].set_title('Residuals vs Fitted')
# normal qqplot
stats.probplot(std_residuals, plot=axs[0, 1])
#sm.qqplot(std_residuals, dist=stats.t, fit=True, line='45', c='#4C72B0',ax=axs[0, 1])
axs[0, 1].set_title('Normal Q-Q')
# Fitted values against actual values
axs[1, 0].scatter(x=y_train, y=poly_lr.predict(X_train))
axs[1, 0].plot(y_train, y_train, color='red', linestyle='dashed')
axs[1, 0].set_xlabel('Actual Values')
axs[1, 0].set_ylabel('Fitted Values')
axs[1, 0].set_title('Fitted vs Actual')
# Histogram of std. residuals
axs[1, 1].hist(std_residuals, density=True)
x = np.linspace(min(std_residuals),max(std_residuals), 500)
axs[1, 1].plot(x, norm.pdf(x),color='red')
axs[1, 1].set_xlabel('Std. Residuals')
axs[1, 1].set_title('Std. Residuals Density Plot')
fig.tight_layout()
plt.show()
Jääkide diagnostika näitab paremaid tulemusi diagrammil Fitted vs. Residuals, kui lineaarregressiooni mudelil. Paistab olevat parim mudel.
Polünomiaalregressiooni astme validatsioonikõver¶
degrees = range(1,4)
a = []
c = []
for deg in degrees:
numerical_preprocessor = Pipeline([
('scaler', StandardScaler()),
('poly2', PolynomialFeatures(degree=deg))])
preprocessor = ColumnTransformer(
[
("ctg", categorical_preprocessor, categorical_columns),
("num", numerical_preprocessor, numerical_columns),
])
poly_lr = Pipeline([
('pre', preprocessor),
('lr', LinearRegression())])
poly_lr.fit(X_train, y_train)
cv_models = cross_validate(estimator=poly_lr,
X=X_train,
y=y_train,
return_estimator=True,
cv=10,
n_jobs=1)
cv_fit = cv_models['estimator']
c.append(r2_score(np.exp(y_train),np.exp(poly_lr.predict(X_train))))
b = []
for i in range(len(cv_fit)):
b.append(r2_score(np.exp(y_test),np.exp(cv_fit[i].predict(X_test))))
a.append(np.mean(b))
plt.figure(figsize=(6, 4))
plt.plot(degrees, a, lw=2,
label='cross-validation test')
plt.plot(degrees, c, lw=2, label='train')
plt.legend(loc='best')
plt.xlabel('degree')
plt.ylabel('R2')
plt.title('Validation curve')
plt.tight_layout()
Polünomiaalregressiooni astme validatsioonikõverad (ristvalideerimise testandmetel ja treeningandmete) ei ole paralleelsed. Samas, nende kujud on iseloomustatud erievate murdenurgaga keskel. Ristvalideerimise test joon (sinine) läheb R2 väärtuse skaalas nullist alla -6 ni, ning sealt mutdub horisontaalse joonena edasi. Train joon läheb ka alla R2 -8 väärtuseni ja sealt murdub terava nurgaga umbes 90 kraadi võrra eelnevast kaldenurgast ülesse.
Selline mudeli käitumine võib vajada täiendavat uurimist ja modelleerimist selleks, et parandada mudeli ennustusvõimet ning vältida ületreenimist. Mudel võib olla liiga keeruline, mis omakorda võib põhjustada ületreenimist ja madalat üldistumist uute andmete suhtes. Võimalikud lahendused on mudeli lihtsustamine, regulaariseerimine (nt Ridge või Lasso) või teiste lihtsamate mudelite kaalumist.
Igale polünomiaalsele astmele vastab kõver, mis näitab, kuidas mudeli jõudlus (vastavalt Y-teljel olevale mõõdikule) muutub polünomiaalse astme suurenedes. Ideaalis soovite leida polünomiaalse astme, mis annab parima jõudluse, kus Y-teljel olev väärtus on maksimaalne.
Kui kõverad on murdepunktide või piikidega, võib see näidata, et mudeli jõudlus muutub dramaatiliselt polünomiaalse astme suurendamisel. Need punktid võivad olla olulised, kui otsite optimaalset mudeli keerukust.
X = df.drop([_SIHTTUNNUS_],axis=1)
y = df[_SIHTTUNNUS_]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Eraldame kategoorilised ja numbrilised muutujad, kasutades nende identifitseerimiseks nende andmetüüpe. Nagu nägime eelnevalt, objekt vastab kategoorilistele veergudele. Kasutame vastavate veergude valimiseks make_column_selector.
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
numerical_columns
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
categorical_columns
['gender', 'original_activity_labels']
Me peame numbrilised ja kategoorilised andmeid valmistama ette modelleerimiseks erinevalt: kategoorilised andmed: tunnuste väärtuste indikaatortunnustega asendamine (one-hot encoding), numbrilised andmed: standardiseerimine/normaliseerimine. Scikit-learn pakub klassi ColumnTransformer, mis lubab jaotada konveieri (pipeline) kaheks osaks, edastades konkreetsed veerud konkreetsetele teisendusmeetoditele. See lubab ühendada mõlemat liiki muutujaid ühes konveieris koos.
categorical_preprocessor = OneHotEncoder(drop='first')
Eelprotsessor numbriliste tunnuste jaoks:
numerical_preprocessor = StandardScaler()
Nüüd loome ColumnTransfomer ja seostame eelprotsessorid vastavate veergudega:
preprocessor = ColumnTransformer(
[
("ctg", categorical_preprocessor, categorical_columns),
("num", numerical_preprocessor, numerical_columns),
]
)
tree_pipe = Pipeline([
('pre', preprocessor),
('tree', DecisionTreeRegressor(random_state=0))])
tree_pipe.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('tree', DecisionTreeRegressor(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('tree', DecisionTreeRegressor(random_state=0))])ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
['gender', 'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
DecisionTreeRegressor(random_state=0)
print('Keskmine R2 täpsus treeningandmetel: %.3f' % tree_pipe.score(X_train, y_train))
print('Keskmine R2 täpsus testandmetel: %.3f' % tree_pipe.score(X_test, y_test))
Keskmine R2 täpsus treeningandmetel: 1.000 Keskmine R2 täpsus testandmetel: 0.975
RMSE enne pöördteisenduse exp():
mse = mean_squared_error(y_train, tree_pipe.predict(X_train))
print(f"RMSE testandmetel: {(np.sqrt(mse)):.3f}")
y_pred = tree_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE testandmetel: {(rmse):.3f}")
RMSE testandmetel: 0.000 RMSE testandmetel: 1.623
Rakendame pöördtesendust exp():
print('R2 täpsus treeningandmetel: %.3f' % r2_score(np.exp(y_train),np.exp(tree_pipe.predict(X_train))))
print('R2 täpsus testandmetel: %.3f' % r2_score(np.exp(y_test),np.exp(tree_pipe.predict(X_test))))
R2 täpsus treeningandmetel: 1.000 R2 täpsus testandmetel: 0.778
RMSE peale pöördteisenduse exp():
y_pred = np.exp(tree_pipe.predict(X_test))
mse = mean_squared_error(np.exp(y_test), y_pred)
rmse = np.sqrt(mse)
print(f"RMSE testandmetel: {(rmse):.3f}")
RMSE testandmetel: 123540818094081495006943483507626838327296.000
Mudel on ületreenitud või eksponentsiaalne pöördteisendus ei ole asjakohane antud andmestikuga Otsustuspuu mudeli jaoks.
Mudeli tähtsamad argumendid:
imp = pd.DataFrame(tree_pipe[1].feature_importances_)
ft = pd.DataFrame(preprocessor.get_feature_names_out())
ft_imp = pd.concat([ft,imp],axis=1)
ft_imp.columns = ['Feature', 'Importance']
ft_imp.sort_values(by='Importance',ascending=False)
| Feature | Importance | |
|---|---|---|
| 21 | num__VE | 0.9461 |
| 0 | ctg__gender_male | 0.0187 |
| 27 | num__HR | 0.0094 |
| 18 | num__METS | 0.0062 |
| 26 | num__VO2.HR | 0.0040 |
| 25 | num__FeCO2 | 0.0039 |
| 23 | num__CO2exp | 0.0025 |
| 24 | num__FeO2 | 0.0024 |
| 17 | num__EEtot | 0.0022 |
| 19 | num__Rf | 0.0021 |
| 22 | num__O2exp | 0.0009 |
| 20 | num__VT | 0.0006 |
| 16 | ctg__original_activity_labels_nan | 0.0002 |
| 2 | ctg__original_activity_labels_lyingDownLeft | 0.0001 |
| 12 | ctg__original_activity_labels_walkingFast | 0.0001 |
| 13 | ctg__original_activity_labels_walkingNormal | 0.0001 |
| 11 | ctg__original_activity_labels_vacuumCleaning | 0.0001 |
| 14 | ctg__original_activity_labels_walkingSlow | 0.0001 |
| 8 | ctg__original_activity_labels_standing | 0.0000 |
| 7 | ctg__original_activity_labels_stakingShelves | 0.0000 |
| 3 | ctg__original_activity_labels_lyingDownRight | 0.0000 |
| 1 | ctg__original_activity_labels_dishwashing | 0.0000 |
| 6 | ctg__original_activity_labels_sittingSofa | 0.0000 |
| 4 | ctg__original_activity_labels_sittingChair | 0.0000 |
| 9 | ctg__original_activity_labels_step | 0.0000 |
| 5 | ctg__original_activity_labels_sittingCouch | 0.0000 |
| 15 | ctg__original_activity_labels_walkingStairsUp | 0.0000 |
| 10 | ctg__original_activity_labels_syncJumping | 0.0000 |
from sklearn.model_selection import GridSearchCV
parameters={"tree__splitter":["best","random"],
"tree__max_depth" : [1,3,5,7,9],
"tree__min_samples_leaf":[1,2,3,4,5,6,7],
"tree__max_features":["log2","sqrt",None],
"tree__max_leaf_nodes":[None,10,20,30] }
gs_tree_pipe = GridSearchCV(estimator=tree_pipe, param_grid=parameters, cv=5, verbose=0)
gs_tree_pipe.fit(X_train, y_train)
gs_tree_pipe.best_params_
{'tree__max_depth': 9,
'tree__max_features': None,
'tree__max_leaf_nodes': None,
'tree__min_samples_leaf': 7,
'tree__splitter': 'best'}
Mudeli täpsus enne pöördteisenduse rakendamist:
print(f"R2 score on train: {gs_tree_pipe.score(X_train, y_train):.3f}")
print(f"R2 score on train: {gs_tree_pipe.score(X_test, y_test):.3f}")
R2 score on train: 0.984 R2 score on train: 0.978
RMSE enne pöördteisenduse exp:
mse = mean_squared_error(y_train, gs_tree_pipe.predict(X_train))
print(f"RMSE testandmetel: {(np.sqrt(mse)):.3f}")
y_pred = gs_tree_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE testandmetel: {(rmse):.3f}")
RMSE testandmetel: 1.330 RMSE testandmetel: 1.537
Mudeli täpsus pärast pöördteisenduse exp() rakendamist:
print(f"R2 score on train: {r2_score(np.exp(y_train),np.exp(gs_tree_pipe.predict(X_train))):.3f}")
print(f"R2 score on test: {r2_score(np.exp(y_test),np.exp(gs_tree_pipe.predict(X_test))):.3f}")
R2 score on train: 0.942 R2 score on test: 0.908
RMSE peale pöördteisenduse exp():
y_pred = np.exp(gs_tree_pipe.predict(X_test))
mse = mean_squared_error(np.exp(y_test), y_pred)
rmse = np.sqrt(mse)
print(f"RMSE testandmetel: {(rmse):.3f}")
RMSE testandmetel: 79616768612770933240036264162054455164928.000
Mudel on ületreenitud või eksponentsiaalne pöördteisendus ei ole asjakohane antud andmestikuga Otsustuspuu mudeli jaoks.
Loome Pipeline'i, mis ühendab andmete eeltöötlemise ja Random Forest regressioonmudeli, ning seejärel treenib mudeli kasutades etteantud treeningandmeid.
rf_pipe = Pipeline([
('pre', preprocessor),
('rf', RandomForestRegressor(random_state=0))])
rf_pipe.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('rf', RandomForestRegressor(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('rf', RandomForestRegressor(random_state=0))])ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
['gender', 'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
RandomForestRegressor(random_state=0)
print('Keskmine R2 täpsus treeningandmetel: %.3f' % rf_pipe.score(X_train, y_train))
print('Keskmine R2 täpsus testandmetel: %.3f' % rf_pipe.score(X_test, y_test))
Keskmine R2 täpsus treeningandmetel: 0.998 Keskmine R2 täpsus testandmetel: 0.989
RMSE:
mse = mean_squared_error(y_train, rf_pipe.predict(X_train))
print(f"Random Forest mudeli RMSE testandmetel: {(np.sqrt(mse)):.3f}")
y_pred = rf_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 0.457 Random Forest mudeli RMSE testandmetel: 1.191
Rakendame pöördtesendust exp(), mudeli parandamiseks.
print('R2 täpsus treeningandmetel: %.3f' % r2_score(np.exp(y_train),np.exp(rf_pipe.predict(X_train))))
print('R2 täpsus testandmetel: %.3f' % r2_score(np.exp(y_test),np.exp(rf_pipe.predict(X_test))))
R2 täpsus treeningandmetel: 0.987 R2 täpsus testandmetel: 0.896
Mudeli tähtsamad argumendid:
imp = pd.DataFrame(rf_pipe[1].feature_importances_)
ft = pd.DataFrame(preprocessor.get_feature_names_out())
ft_imp = pd.concat([ft,imp],axis=1)
ft_imp.columns = ['Feature', 'Importance']
ft_imp.sort_values(by='Importance',ascending=False)
| Feature | Importance | |
|---|---|---|
| 21 | num__VE | 0.9461 |
| 0 | ctg__gender_male | 0.0186 |
| 27 | num__HR | 0.0076 |
| 18 | num__METS | 0.0063 |
| 26 | num__VO2.HR | 0.0055 |
| 25 | num__FeCO2 | 0.0036 |
| 19 | num__Rf | 0.0029 |
| 24 | num__FeO2 | 0.0026 |
| 17 | num__EEtot | 0.0022 |
| 23 | num__CO2exp | 0.0019 |
| 22 | num__O2exp | 0.0010 |
| 20 | num__VT | 0.0008 |
| 16 | ctg__original_activity_labels_nan | 0.0002 |
| 11 | ctg__original_activity_labels_vacuumCleaning | 0.0001 |
| 2 | ctg__original_activity_labels_lyingDownLeft | 0.0001 |
| 13 | ctg__original_activity_labels_walkingNormal | 0.0001 |
| 12 | ctg__original_activity_labels_walkingFast | 0.0001 |
| 14 | ctg__original_activity_labels_walkingSlow | 0.0001 |
| 8 | ctg__original_activity_labels_standing | 0.0001 |
| 7 | ctg__original_activity_labels_stakingShelves | 0.0001 |
| 1 | ctg__original_activity_labels_dishwashing | 0.0000 |
| 4 | ctg__original_activity_labels_sittingChair | 0.0000 |
| 3 | ctg__original_activity_labels_lyingDownRight | 0.0000 |
| 6 | ctg__original_activity_labels_sittingSofa | 0.0000 |
| 9 | ctg__original_activity_labels_step | 0.0000 |
| 5 | ctg__original_activity_labels_sittingCouch | 0.0000 |
| 15 | ctg__original_activity_labels_walkingStairsUp | 0.0000 |
| 10 | ctg__original_activity_labels_syncJumping | 0.0000 |
GridSearchCV kasutamine võimaldab Random Forest mudeli hüperparameetrite optimeerimist ning eeldatavasti tagab kõrgemat täpsust. Kasutame parameetrite GridSearchCV tuunimist:
# param_grid_rf = {
# 'rf__n_estimators': [10, 50, 100, 500, 1000],
# 'rf__max_features': ['log2', 'sqrt', 0.8,1]
# }
param_grid_rf = {
'rf__n_estimators': [10, 50, 100],
'rf__max_features': ['log2', 'sqrt', 0.8,1]
}
gs_rf_pipe = GridSearchCV(estimator=rf_pipe, param_grid=param_grid_rf, cv=5, verbose=0)
gs_rf_pipe.fit(X_train, y_train)
gs_rf_pipe.best_params_
{'rf__max_features': 0.8, 'rf__n_estimators': 100}
Mudeli täpsus enne pöördteisenduse rakendamist:
print(gs_rf_pipe.score(X_train, y_train))
print(gs_rf_pipe.score(X_test, y_test))
0.9985355652170951 0.9894740695731525
Mudeli RMSE:
mse = mean_squared_error(y_train, gs_rf_pipe.predict(X_train))
print(f"Random Forest mudeli RMSE testandmetel: {(np.sqrt(mse)):.3f}")
y_pred = gs_rf_pipe.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 0.400 Random Forest mudeli RMSE testandmetel: 1.058
Mudeli täpsus peale pöördteisenduse exp() rakendamist:
print(f"R2 score on train: {r2_score(np.exp(y_train),np.exp(gs_rf_pipe.predict(X_train))):.3f}")
print(f"R2 score on test: {r2_score(np.exp(y_test),np.exp(gs_rf_pipe.predict(X_test))):.3f}")
R2 score on train: 0.986 R2 score on test: 0.910
Mudeli kirjeldusvõime peaaegu sama.
RMSE peale GridSearchCV tuunimist ja peale pöördteisendust exp():
y_pred = np.exp(gs_rf_pipe.predict(X_test))
mse = mean_squared_error(np.exp(y_test), y_pred)
rmse = np.sqrt(mse)
print(f"Random Forest mudeli RMSE testandmetel: {(rmse):.3f}")
Random Forest mudeli RMSE testandmetel: 256069566736770598900636151744551404437504.000
Mudel on ületreenitud või eksponentsiaalne pöördteisendus ei ole asjakohane antud andmestikuga Random Forest mudeli jaoks.
Mudeli argumendid:
preprocessor.get_feature_names_out()
array(['ctg__gender_male', 'ctg__original_activity_labels_dishwashing',
'ctg__original_activity_labels_lyingDownLeft',
'ctg__original_activity_labels_lyingDownRight',
'ctg__original_activity_labels_sittingChair',
'ctg__original_activity_labels_sittingCouch',
'ctg__original_activity_labels_sittingSofa',
'ctg__original_activity_labels_stakingShelves',
'ctg__original_activity_labels_standing',
'ctg__original_activity_labels_step',
'ctg__original_activity_labels_syncJumping',
'ctg__original_activity_labels_vacuumCleaning',
'ctg__original_activity_labels_walkingFast',
'ctg__original_activity_labels_walkingNormal',
'ctg__original_activity_labels_walkingSlow',
'ctg__original_activity_labels_walkingStairsUp',
'ctg__original_activity_labels_nan', 'num__EEtot', 'num__METS',
'num__Rf', 'num__VT', 'num__VE', 'num__O2exp', 'num__CO2exp',
'num__FeO2', 'num__FeCO2', 'num__VO2.HR', 'num__HR'], dtype=object)
Mudeli kordajad:
# rf_pipe = Pipeline([
# ('pre', preprocessor),
# ('rf',RandomForestRegressor(max_features='log2', n_estimators= 1000))])
rf_pipe = Pipeline([
('pre', preprocessor),
('rf',RandomForestRegressor(max_features='log2', n_estimators= 100))])
rf_pipe.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('rf', RandomForestRegressor(max_features='log2'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('ctg',
OneHotEncoder(drop='first'),
['gender',
'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT',
'VE', 'O2exp', 'CO2exp',
'FeO2', 'FeCO2', 'VO2.HR',
'HR'])])),
('rf', RandomForestRegressor(max_features='log2'))])ColumnTransformer(transformers=[('ctg', OneHotEncoder(drop='first'),
['gender', 'original_activity_labels']),
('num', StandardScaler(),
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp',
'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR'])])['gender', 'original_activity_labels']
OneHotEncoder(drop='first')
['EEtot', 'METS', 'Rf', 'VT', 'VE', 'O2exp', 'CO2exp', 'FeO2', 'FeCO2', 'VO2.HR', 'HR']
StandardScaler()
RandomForestRegressor(max_features='log2')
imp = pd.DataFrame(rf_pipe[1].feature_importances_)
ft = pd.DataFrame(preprocessor.get_feature_names_out())
ft_imp = pd.concat([ft,imp],axis=1)
ft_imp.columns = ['Feature', 'Importance']
ft_imp.sort_values(by='Importance',ascending=False)
| Feature | Importance | |
|---|---|---|
| 21 | num__VE | 0.2356 |
| 18 | num__METS | 0.1878 |
| 22 | num__O2exp | 0.1071 |
| 20 | num__VT | 0.0865 |
| 19 | num__Rf | 0.0792 |
| 26 | num__VO2.HR | 0.0774 |
| 27 | num__HR | 0.0670 |
| 23 | num__CO2exp | 0.0621 |
| 17 | num__EEtot | 0.0364 |
| 24 | num__FeO2 | 0.0198 |
| 25 | num__FeCO2 | 0.0174 |
| 0 | ctg__gender_male | 0.0077 |
| 12 | ctg__original_activity_labels_walkingFast | 0.0047 |
| 16 | ctg__original_activity_labels_nan | 0.0027 |
| 5 | ctg__original_activity_labels_sittingCouch | 0.0016 |
| 13 | ctg__original_activity_labels_walkingNormal | 0.0015 |
| 3 | ctg__original_activity_labels_lyingDownRight | 0.0012 |
| 11 | ctg__original_activity_labels_vacuumCleaning | 0.0008 |
| 6 | ctg__original_activity_labels_sittingSofa | 0.0007 |
| 4 | ctg__original_activity_labels_sittingChair | 0.0007 |
| 14 | ctg__original_activity_labels_walkingSlow | 0.0006 |
| 8 | ctg__original_activity_labels_standing | 0.0005 |
| 1 | ctg__original_activity_labels_dishwashing | 0.0004 |
| 2 | ctg__original_activity_labels_lyingDownLeft | 0.0003 |
| 7 | ctg__original_activity_labels_stakingShelves | 0.0003 |
| 9 | ctg__original_activity_labels_step | 0.0001 |
| 15 | ctg__original_activity_labels_walkingStairsUp | 0.0000 |
| 10 | ctg__original_activity_labels_syncJumping | 0.0000 |
Peale GridSearchCV tuunimist kordajate tähtsuste väärtused on muutunud. num__VE tunnuse tähtsus on langenud ca. 3 korda. Kõige tähtsamad kordajad:
- 21 num__VE
- 18 num__METS
- 22 num__O2exp
- 20 num__VT
- 19 num__RF
Tulemuste analüüs¶
Tulemuste võrdlus RMSE ja R2 (testandmestikul)
| Meetod | R2 train | R2 test | RMSE train | RMSE test |
| Lineaarregressiooni mudel ainult arvuliste tunnustega | 0.970 | 0.970 | 1.807 | 1.786 |
| Lineaarregressiooni mudel koos kategooriliste tunnustega | 0.974 | 0.974 | 1.671 | 1.653 |
| Polünomiaalregressioon | 0.985 | 0.984 | 1.309 | 1.300 |
| Otsustuspuu regressioon | 1.000 | 0.975 | 0.000 | 1.623 |
| Otsustuspuu regressioon *GridSearchCV* tuunimisel | 0.984 | 0.978 | 1.330 | 1.537 |
| Random Forest regressioon | 0.998 | 0.989 | 0.457 | 1.191 |
| Random Forest regressioon *GridSearchCV* tuunimisel | 0.998 | 0.989 | 0.400 | 1.058 |
Kokkuvõte¶
Töö käigus tehtud andmeanalüüsi jaoks on kasutatud erinevad meetodid: Lineaarne Regressioon, Polünomiaalne Regressioon, Otsustuspuu Regressioon ja Random Forest Regressioon.
Hoolimata nende mudelite kirjeldusvõime sarnasustest on leitud erinevused tulemustes:
- Random Forest regressioonimudel saavutas kõrgeima täpsuse testandmetel 99.8% ja RMSE 1.058, mis on kõigest lähedam nullile.
- Lineaarregressiooni ja Random Foresti mudelitel täpsuse vahe ligi 1.5%. Lineaarregressioon eeldab lineaarset seost omaduste ja sihttunnuse vahel, samas kui Random Forest mudel teeb mudelite ansambli, mis kasutavad mitmeid otsustuspuusid.
- Tuunimata Otsustuspuu regressioonimudel näitas testandmetel veidramat kirjeldusvõimet 100% ja RMSE 0.000, kuid testandmetel tulemused realistlikumad.
- Tulemused olid sarnased Polünomiaalregressiooni mudelil ja Otsustuspuu GridSearch tuunitud mudelil.
Random Forest mudel sobib selle andmestiku sihttunnuse BR (breath rate) ennustamiseks kõige paremini võrreldes teiste mudelitega. See meetod võib olla paremini kohandatud andmetele, mis sisaldavad keerukamaid ja mitmekesisemaid mustreid või mida ei ole lihtne lineaarse mudeliga kajastada.
Otsustuspuud suudavad modelleerida mitte-lineaarseid suhteid, kuid need mudelid võivad hädas olla teatud keerukustega andmestike analüüsimisel.
Random Forest mudelid on tavaliselt komplekssemad kui otsustuspuu mudelid, kuna nad koosnevad mitmest otsustuspuudest. See võimaldab Random Forest mudelitel kohaneda keerukamate andmestikega ja võib põhjustada paremat jõudlust võrreldes üksikute otsustuspuu mudelitega.
Täpsuse tulemuste sarnasus Lineaarregressiooni ja Polünomiaalregressiooni vahel viitab nende sarnasusele alusmustrite kinnipüüdmisel andmetes.
Lineaarregressiooni jääkide hajuvuse diagramm Fitted vs. Residuals näitas, et jäägid on konsolideeritud teatud vormi, mis viitab sellele, et mudel ei pruugi adekvaatselt kajastada mõningaid alusstruktuure andmestikus. See olukord võib omada mitmeid tagajärgi:
- Mitte-lineaarsus: Konsolideeritud kuju jääkides võib näidata mitte-lineaarsust ennustajate ja sihtmuutuja vahelises suhtes. Andmeanalüüsi tegemisel on leitud, et mudelil on vähe lineaarseid suhteid tunnuste vahel ja seega mudel ei pruugi olla täielikult täpne.
- Enamiku muutujate vahel puuduvad seosed.
- Mitmekollineaarsust, mis on rohkem tõenäoline antud andmestiku korrelatsiooni maatriksi ülevaatel.
Kokkuvõttes näitab analüüs, et teatud mudelid, nagu Random Forest ja Koosmõjude Mudel, ületasid teisi mudeleid täpsuse osas.